Skip to content

fix: prevent self-update re-exec loop#16

Merged
TerrifiedBug merged 1 commit intomainfrom
fix/update-loop-v2
Mar 5, 2026
Merged

fix: prevent self-update re-exec loop#16
TerrifiedBug merged 1 commit intomainfrom
fix/update-loop-v2

Conversation

@TerrifiedBug
Copy link
Owner

Summary

  • Skip update when agent is already running the target version — prevents infinite syscall.Exec loop where the new process polls, gets the same pendingAction, and re-execs again before a heartbeat can clear it
  • Shut down all Vector pipelines before syscall.Exec to prevent orphaned processes and port binding conflicts

Context

After a successful update, the agent re-execs via syscall.Exec. The new process starts, polls config immediately, and gets the same pendingAction because no heartbeat was sent to clear it. This creates a tight loop of download → verify → replace → re-exec → repeat every ~1 second.

Test plan

  • Trigger a binary update from the fleet UI
  • Verify agent updates once and stops (no repeated "self-update requested" logs)
  • Verify Vector pipelines restart cleanly without "Address in use" errors
  • Verify the agent correctly reports its version after update

- Skip update when already running target version (breaks re-exec loop)
- Shut down all pipelines before syscall.Exec to prevent port conflicts
@github-actions github-actions bot added fix agent and removed fix labels Mar 5, 2026
@greptile-apps
Copy link

greptile-apps bot commented Mar 5, 2026

Greptile Summary

This PR fixes two related issues in the agent's self-update flow: an infinite syscall.Exec loop caused by the new process immediately re-triggering the same pendingAction, and orphaned Vector child processes left behind during re-exec.

Changes:

  • agent.go: Adds a version guard — if action.TargetVersion == Version, the update is skipped and execution falls through to the next heartbeat, which clears the pendingAction on the server side. This is the primary fix for the re-exec loop.
  • updater.go: Calls supervisor.ShutdownAll() between the atomic binary rename and syscall.Exec, ensuring all Vector processes are gracefully terminated before the new agent image takes over.

Correctness: The version guard is well-placed — it runs before the failedUpdateVersion check and before any I/O, so the fast path is cheap. The ShutdownAll() call runs goroutines in parallel and already has a per-process 30 s SIGTERM → SIGKILL timeout in stopProcess, so the worst-case blocking window is bounded at 30 s regardless of the number of running pipelines.

Confidence Score: 4/5

  • This PR is safe to merge; both changes are minimal, well-scoped, and correctly address the described loop and orphan-process issues.
  • The version-equality check is the correct fix for the re-exec loop, and ShutdownAll() cleanly prevents port conflicts. The one minor concern — ShutdownAll() can hold the pre-exec window open for up to 30 s after the binary is already replaced — is a bounded, acceptable trade-off but deserves a comment for future maintainers.
  • No files require special attention beyond the style note on updater.go line 90.

Important Files Changed

Filename Overview
agent/internal/agent/agent.go Adds an early-exit guard in handlePendingAction that skips the self-update when the agent is already running action.TargetVersion, preventing the post-re-exec polling loop.
agent/internal/agent/updater.go Calls supervisor.ShutdownAll() between the atomic binary rename and syscall.Exec to prevent orphaned Vector processes. ShutdownAll can block up to 30 s (SIGTERM timeout) after the binary is already replaced, which is an acceptable trade-off but worth documenting.

Sequence Diagram

sequenceDiagram
    participant A1 as Agent v1
    participant S as Server
    participant FS as Filesystem
    participant A2 as Agent v2

    A1->>S: pollAndApply()
    S-->>A1: pendingAction {self_update, targetVersion: v2}
    Note over A1: Version check: "v1" ≠ "v2" → proceed
    A1->>FS: Download + verify binary
    A1->>FS: os.Rename(tmp → execPath)
    A1->>A1: supervisor.ShutdownAll() (≤30s)
    A1->>A2: syscall.Exec (process replaced)

    A2->>S: pollAndApply()
    S-->>A2: pendingAction {self_update, targetVersion: v2}
    Note over A2: Version check: "v2" == "v2" → SKIP (loop prevented)
    A2->>S: sendHeartbeat()
    S-->>A2: pendingAction cleared
Loading

Last reviewed commit: b4b3fae

// Shut down all pipelines before re-exec to avoid orphaned Vector processes
// and port conflicts when the new agent starts them again.
slog.Info("stopping all pipelines before re-exec")
a.supervisor.ShutdownAll()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShutdownAll can block up to 30 seconds post-rename

ShutdownAll() sends SIGTERM to every Vector process and then waits up to 30 seconds for each one (they run in parallel, so worst-case is 30 s total). This happens after the new binary has already been atomically renamed into place at execPath. If the agent process is killed during this window (OOM, stray signal, hardware reset), the new binary sits in place but the agent isn't running — there's nothing to auto-restart it.

The window is bounded and the parallel shutdown keeps it at one timeout cycle regardless of pipeline count, so this is a reasonable trade-off. It's worth a comment calling out the intentional 30 s window, and potentially wrapping ShutdownAll in a context with a hard deadline to make the bound explicit:

// ShutdownAll waits for all Vector processes to exit (up to 30 s each,
// running in parallel) before we re-exec. The new binary is already in
// place at execPath; if this process is killed here the agent will need
// to be restarted manually.
slog.Info("stopping all pipelines before re-exec")
a.supervisor.ShutdownAll()
Prompt To Fix With AI
This is a comment left during a code review.
Path: agent/internal/agent/updater.go
Line: 90

Comment:
**`ShutdownAll` can block up to 30 seconds post-rename**

`ShutdownAll()` sends `SIGTERM` to every Vector process and then waits up to **30 seconds** for each one (they run in parallel, so worst-case is 30 s total). This happens *after* the new binary has already been atomically renamed into place at `execPath`. If the agent process is killed during this window (OOM, stray signal, hardware reset), the new binary sits in place but the agent isn't running — there's nothing to auto-restart it.

The window is bounded and the parallel shutdown keeps it at one timeout cycle regardless of pipeline count, so this is a reasonable trade-off. It's worth a comment calling out the intentional 30 s window, and potentially wrapping `ShutdownAll` in a context with a hard deadline to make the bound explicit:

```go
// ShutdownAll waits for all Vector processes to exit (up to 30 s each,
// running in parallel) before we re-exec. The new binary is already in
// place at execPath; if this process is killed here the agent will need
// to be restarted manually.
slog.Info("stopping all pipelines before re-exec")
a.supervisor.ShutdownAll()
```

How can I resolve this? If you propose a fix, please make it concise.

@TerrifiedBug TerrifiedBug merged commit acc199e into main Mar 5, 2026
12 checks passed
@TerrifiedBug TerrifiedBug deleted the fix/update-loop-v2 branch March 5, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant